Goto

Collaborating Authors

 language representation model


Semantic similarity estimation for domain specific data using BERT and other techniques

Prashanth, R.

arXiv.org Artificial Intelligence

Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora's question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.


PERCORE: A Deep Learning-Based Framework for Persian Spelling Correction with Phonetic Analysis

Dashti, Seyed Mohammad Sadegh, Bardsiri, Amid Khatibi, Shahbazzadeh, Mehdi Jafari

arXiv.org Artificial Intelligence

This research introduces a state-of-the-art Persian spelling correction system that seamlessly integrates deep learning techniques with phonetic analysis, significantly enhancing the accuracy and efficiency of natural language processing (NLP) for Persian. Utilizing a fine-tuned language representation model, our methodology effectively combines deep contextual analysis with phonetic insights, adeptly correcting both non-word and real-word spelling errors. This strategy proves particularly effective in tackling the unique complexities of Persian spelling, including its elaborate morphology and the challenge of homophony. A thorough evaluation on a wide-ranging dataset confirms our system's superior performance compared to existing methods, with impressive F1-Scores of 0.890 for detecting real-word errors and 0.905 for correcting them. Additionally, the system demonstrates a strong capability in non-word error correction, achieving an F1-Score of 0.891. These results illustrate the significant benefits of incorporating phonetic insights into deep learning models for spelling correction. Our contributions not only advance Persian language processing by providing a versatile solution for a variety of NLP applications but also pave the way for future research in the field, emphasizing the critical role of phonetic analysis in developing effective spelling correction system.


Do Not Harm Protected Groups in Debiasing Language Representation Models

Zhu, Chloe Qinyu, Stureborg, Rickard, Fain, Brandon

arXiv.org Artificial Intelligence

Language Representation Models (LRMs) trained with real-world data may capture and exacerbate undesired bias and cause unfair treatment of people in various demographic groups. Several techniques have been investigated for applying interventions to LRMs to remove bias in benchmark evaluations on, for example, word embeddings. However, the negative side effects of debiasing interventions are usually not revealed in the downstream tasks. We propose xGAP-DEBIAS, a set of evaluations on assessing the fairness of debiasing. In this work, We examine four debiasing techniques on a real-world text classification task and show that reducing biasing is at the cost of degrading performance for all demographic groups, including those the debiasing techniques aim to protect. We advocate that a debiasing technique should have good downstream performance with the constraint of ensuring no harm to the protected group.


MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

Li, Hui, Yang, Xuekang, Zhao, Xin, Yu, Lin, Zheng, Jiping, Sun, Wei

arXiv.org Artificial Intelligence

Incorporating prior knowledge into pre-trained language models has proven to be effective for knowledge-driven NLP tasks, such as entity typing and relation extraction. Current pre-training procedures usually inject external knowledge into models by using knowledge masking, knowledge fusion and knowledge replacement. However, factual information contained in the input sentences have not been fully mined, and the external knowledge for injecting have not been strictly checked. As a result, the context information cannot be fully exploited and extra noise will be introduced or the amount of knowledge injected is limited. To address these issues, we propose MLRIP, which modifies the knowledge masking strategies proposed by ERNIE-Baidu, and introduce a two-stage entity replacement strategy. Extensive experiments with comprehensive analyses illustrate the superiority of MLRIP over BERT-based models in military knowledge-driven NLP tasks.


Word2vec vs BERT

#artificialintelligence

Both word2vec and BERT are recent popular methods in NLP which are used for generating vector representation of words. Essentially replacing the use of word index dictionaries and one hot encoded vectors to represent text. Both word-index and one hot encoding methods do not capture the semantic sense of language. Also, one hot encoding becomes computationally infeasible if the size of vocabulary is LARGE. Word2vec [1] is a neural network approach to learn distributed word vectors in a way that words used in similar syntactic or semantic context, lie closer to each other in the distributed vector space.


10 Must-read AI Papers

#artificialintelligence

We have put together a list of 10 most cited and discussed research papers in machine learning that published over the past 10 years, from AlexNet to GPT-3. These are great readings for researchers new to this field and freshers for experienced researchers. For each paper, we provide links to the short overview, author presentations and detailed paper walkthrough for readers with different levels of expertise. Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art.


BERTSurv: BERT-Based Survival Models for Predicting Outcomes of Trauma Patients

Zhao, Yun, Hong, Qinghang, Zhang, Xinlu, Deng, Yu, Wang, Yuqing, Petzold, Linda

arXiv.org Artificial Intelligence

Survival analysis is a technique to predict the times of specific outcomes, and is widely used in predicting the outcomes for intensive care unit (ICU) trauma patients. Recently, deep learning models have drawn increasing attention in healthcare. However, there is a lack of deep learning methods that can model the relationship between measurements, clinical notes and mortality outcomes. In this paper we introduce BERTSurv, a deep learning survival framework which applies Bidirectional Encoder Representations from Transformers (BERT) as a language representation model on unstructured clinical notes, for mortality prediction and survival analysis. We also incorporate clinical measurements in BERTSurv. With binary cross-entropy (BCE) loss, BERTSurv can predict mortality as a binary outcome (mortality prediction). With partial log-likelihood (PLL) loss, BERTSurv predicts the probability of mortality as a time-to-event outcome (survival analysis). We apply BERTSurv on Medical Information Mart for Intensive Care III (MIMIC III) trauma patient data. For mortality prediction, BERTSurv obtained an area under the curve of receiver operating characteristic curve (AUC-ROC) of 0.86, which is an improvement of 3.6% over baseline of multilayer perceptron (MLP) without notes. For survival analysis, BERTSurv achieved a concordance index (C-index) of 0.7. In addition, visualizations of BERT's attention heads help to extract patterns in clinical notes and improve model interpretability by showing how the model assigns weights to different inputs.


microsoft/AzureML-BERT

#artificialintelligence

This repo contains end-to-end recipes to pretrain and finetune the BERT (Bidirectional Encoder Representations from Transformers) language representation model using Azure Machine Learning service. That implementation uses ONNX Runtime to accelerate training and it can be used in environments with GPU including Azure Machine Learning service. Details on using ONNX Runtime for training and accelerating training of Transformer models like BERT and GPT-2 are available in the blog at ONNX Runtime Training Technical Deep Dive. BERT is a language representation model that is distinguished by its capacity to effectively capture deep and subtle textual relationships in a corpus. In the original paper, the authors demonstrate that the BERT model could be easily adapted to build state-of-the-art models for a number of NLP tasks, including text classification, named entity recognition and question answering.


Data Science Papers for Spring 2020

#artificialintelligence

Pain Points, Needs, and Design Opportunities This paper is a study done on the usage of notebooks for data science. It cover a bunch of the negative impacts of using notebooks for data science. Deployment, setup, collaboration, and reliablity are a few of the examples. Quantifying the Carbon Emissions of Machine Learning Training a neural network can take a lot of computer processing power. This processing power comes at a cost to the environment.


Researchers spot origins of stereotyping in AI language technologies

#artificialintelligence

A team of researchers has identified a set of cultural stereotypes that are introduced into artificial intelligence models for language early in their development--a finding that adds to our understanding of the factors that influence results yielded by search engines and other AI-driven tools. "Our work identifies stereotypes about people that widely used AI language models pick up as they learn English. The models we're looking at, and others like them for other languages, are the building blocks of most modern language technologies, from translation systems to question-answering personal assistants to industry tools for resume screening, highlighting the real danger posed by the use of these technologies in their current state," says Sam Bowman, an assistant professor at NYU's Department of Linguistics and Center for Data Science and the paper's senior author. "We expect this effort and related projects will encourage future research towards building more fair language processing systems." The work dovetails with recent scholarship, such as Safiya Umoja Noble's "Algorithms of Oppression: How Search Engines Reinforce Racism" (NYU Press, 2018), which chronicles how racial and other biases have plagued widely used language technologies.